Group members: Kaushal Chaudhary, Gauresh Chavan, Mohit Ruke
ABSTRACT
This research project presents an application of three machine learning (ML) algorithms: Logistic Regression on the Wisconsin Diagnostic Breast Cancer (WDBC) , Extra Tree Classifier on Wisconsin Prognostic Breast Cancer (WPBCC) dataset and Support Vector Classifier (SVC) on WPBC by measuring their classification test accuracy, and their sensitivity and specificity values.
INTRODUCTION
Breast Cancer is the most common form of cancer in women, affecting almost 12% of all women across the world. In recent years, the incidence rate keeps increasing and data show that the survival rate is 88% after first five years from diagnosis and 80% after 10 years from diagnosis. Early prediction of cancer is one of the most crucial steps in the follow-up process [3]. Over past few decades, scientists applied different methods, such as screening in early stage, so that types of cancer can be identifies before their symptoms. With advent of new technologies, a large amounts of cancer data are available to the research community. However, an accurate prediction of a disease outcome is still one of the challenging tasks for the researchers and physicians around the world [2]. In this paper, we will develop, using machine learning techniques, methods that will allow accurate prognosis of cancer.
Today, despite the many advances in early detection of diseases, cancer patients have a poor prognosis and the survival rates in them are low [1]. In cancer prediction/prognosis one is concerned with the following [4]:
1) classification of tumor type
2) prediction of cancer recurrence and
3) prediction of cancer survivability.
In the first case, one is trying to predict the type of tumor (malign or benign) prior to the occurrence of the disease. In the second case one is trying to predict the likelihood of redeveloping cancer. In the third case one is trying to predict an outcome (life expectancy, survivability, progression, tumor-drug sensitivity) after the diagnosis of the disease. In the latter two situations the success of the prognostic prediction is obviously dependent on the success or quality of the diagnosis [4].
DATA SOURCES
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Reading data from CSV
cancer = pd.read_csv("cancer.csv")
cancer.head()
cancer.describe()
cancer.shape
cancer.columns
The diagnosis column in our data contains string values which needs to be converted into binary integer values our logistic model to process. If the tumor is malignant, result = 1 else 0
def converter(result):
if result=='M':
return 1
else:
return 0
cancer['result'] = cancer['diagnosis'].apply(converter)
cancer.head()
from sklearn.model_selection import train_test_split
X = cancer[['radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst']]
y = cancer['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Credits:Randal Olson TPOT
TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for our data.Once TPOT is finished searching, it provides us with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
from tpot import TPOTClassifier
from sklearn.feature_selection import RFE
tpot = TPOTClassifier(generations=10, population_size=10, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
X = cancer[['radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst']]
y = cancer['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
log_reg = linear_model.LogisticRegressionCV()
log_reg.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(log_reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg.score(X_test, y_test)))
Regularization
We will be using L2 (ridge) regularization that comes default with the LogisticRegression() from Scikit-Learn and. We will play along with C value - a parameter to control the strength of regularization and see if Regularization helps our model.
log_reg100 = LogisticRegression(C=100)
log_reg100.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(log_reg100.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg100.score(X_test, y_test)))
log_reg001 = LogisticRegression(C=0.01)
log_reg001.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(log_reg001.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg001.score(X_test, y_test)))
plt.figure(figsize=(10,6))
plt.plot(log_reg.coef_.T, 'o', label='C=1')
plt.plot(log_reg100.coef_.T, '^', label='C=100')
plt.plot(log_reg001.coef_.T, 'v', label='C=0.01')
plt.xticks(range(X.shape[1]), X, rotation=90)
plt.hlines(0,0, X.shape[1])
plt.ylim(-5,5)
plt.xlabel('Index')
plt.ylabel('Coefficient Magnitude')
plt.legend()
print ("---Logistic Model---")
log_roc_auc001 = roc_auc_score(y_test, log_reg001.predict(X_test))
print ("Logistic_001 AUC: ", log_roc_auc001 )
print(classification_report(y_test,log_reg001.predict(X_test)))
print ("---Logistic Model---")
log_roc_auc100 = roc_auc_score(y_test, log_reg100.predict(X_test))
print ("Logistic_100 AUC: ", log_roc_auc100 )
print(classification_report(y_test,log_reg100.predict(X_test)))
print ("---Logistic Model---")
log_roc_auc = roc_auc_score(y_test, log_reg.predict(X_test))
print ("Logistic AUC: ", log_roc_auc )
print(classification_report(y_test,log_reg.predict(X_test)))
Evaluation: Plotting ROC curve
What is ROC?
The ROC curve is a fundamental tool for diagnostic test evaluation.
In a ROC curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).
fpr, tpr, thresholds = roc_curve(y_test, log_reg.predict_proba(X_test)[:,1])
fpr100, tpr100, thresholds100 = roc_curve(y_test, log_reg100.predict_proba(X_test)[:,1])
fpr001, tpr001, thresholds001 = roc_curve(y_test, log_reg001.predict_proba(X_test)[:,1])
plt.figure(figsize = (12,8))
plt.plot(fpr, tpr, label ="Log_reg (area = %0.2f)" % log_roc_auc, color ="red")
plt.plot(fpr100, tpr100, label ="log_reg100 (area = %0.2f)" % log_roc_auc100, color ="blue")
plt.plot(fpr001, tpr001, label ="log_reg001 (area = %0.2f)" % log_roc_auc001, color ="green")
plt.plot([0,1],[0,1], 'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
From the figure above, it is quite clear that with greater 'C' parameter, accuracy is much better.
Dataset : Wisconsin Prognostic Breast Cancer Dataset
Attribute Information
1) ID number
2) Outcome (R = recur, N = nonrecur)
3) Time (recurrence time if field 2 = R, disease-free time if
field 2 = N)
4-33) Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
Importing useful packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
## importing the dataset
dataset = pd.read_csv("wpbc_data.csv")
dataset.head(10)
a=dataset['Lymph node status'].replace('?',0)
b=a.astype(int)
data=dataset.drop('Lymph node status',axis=1)
data.head()
Converting catagorical variable into binary values for analysis.
def converter(result):
if result=='R':
return 1
else:
return 0
Dataset is filtered to reflect a particular endpoint; e.g., recurrences before 24 months = positive (1), nonrecurrence beyond 24 months = negative (0).
def converter_1(v1,v2):
if v1==1:
if v2<24:
return v2
else:
return v2
else:
if v2>24:
return 0
else:
return v2
data['recurance'] = dataset['recurance'].apply(converter)
data['time']=data.apply(lambda data:converter_1(data['recurance'],data['time']),axis=1)
df=pd.concat([data,b],axis=1)
df.head(15)
df.columns
sns.heatmap(df.corr(),cmap="rainbow",annot=True)
plt.figure(figsize = (30,30))
Categorizing data into three sections to for plotting purposes.
pp1=data[['time', 'radius_mean', 'texture_mean',
'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'symmetry_mean',
'fractal_dimension_mean','recurance']]
pp2=data[['time','radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se', 'concavity_se',
'concave points_se', 'symmetry_se', 'fractal_dimension_se','recurance']]
pp3=data[['time','radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst', 'concavity_worst',
'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst','recurance']]
sns.pairplot(pp1)
plt.title('Mean')
sns.pairplot(pp2)
plt.title('Standard Error')
sns.pairplot(pp3)
plt.title('Worst')
We have 33 features in our dataset.To determine the best accuracy that is to determine the recurance or non-recurance of the cancer depending on the parameters of the tumor. If we select all the features, we might end up in overfitting the data model which we want to neglect. By using Recursive Feature Engineering Model (RFE) we will take only 10 best features which the model will determine and based on the result we will fit the model. The Target Variable has 2 outcomes, 'R' for recurance and 'N' for non-recurance. Since the Target variable is in String, we need to convert the values into boolean. In the 'time' feature, we have values for nonrecurance more than 24 months which we can say that the the possibility on recurance of cancer is less likely. Hence we converted that value to 0 and Hence made our dataset for further processing.
X = df[['time', 'radius_mean', 'texture_mean',
'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'symmetry_mean',
'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se', 'concavity_se',
'concave points_se', 'symmetry_se', 'fractal_dimension_se',
'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst', 'concavity_worst',
'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst',
'Tumor size','Lymph node status']]
Y = df['recurance']
print(X.shape,'X')
print(Y.shape,'Y')
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
from tpot import TPOTClassifier
from sklearn.feature_selection import RFE
tpot = TPOTClassifier(generations=10, population_size=10, verbosity=2)
tpot.fit(X_train, Y_train)
print(tpot.score(X_test, Y_test))
Here, The best pipeline for this dataset is LinearSVC model which may give an accuracy of 82.5 perfcent by selecting the proper tuning parameters.
The objective of a Linear SVC (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the "predicted" class is. This makes this specific algorithm rather suitable for our uses, though you can use this for many situations.
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,roc_auc_score,roc_curve,auc
X = df[['time', 'radius_mean', 'texture_mean',
'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'symmetry_mean',
'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se', 'concavity_se',
'concave points_se', 'symmetry_se', 'fractal_dimension_se',
'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst', 'concavity_worst',
'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst',
'Tumor size','Lymph node status']]
Y = df['recurance']
print(X.shape,'X')
print(Y.shape,'Y')
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .3, random_state=0)
svc = LinearSVC(C=25.0, dual=True, loss='squared_hinge', penalty='l2', tol=1e-05)
svc.fit(X_train, y_train)
print('The accuracy on the training subset: {:.3f}'.format(svc.score(X_train, y_train)))
print('The accuracy on the test subset: {:.3f}'.format(svc.score(X_test, y_test)))
predictions = svc.predict(X_test)
print(confusion_matrix(y_test,predictions))
True Positives: 40
True Negatives: 11
print(classification_report(y_test,predictions))
From the above confusion matrix, we can infer that our current model is able to predict True positives and True negatives and since we got an accuracy of 81% there are some False positives and false negatives which we want to reduce by selecting the important features and trying to tune the model further. More about RFE
rfe = RFE(svc, n_features_to_select=10)
rfe.fit(X,Y)
rfe.ranking_
S=rfe.transform(X)
Y=df['recurance']
X_train, X_test, y_train, y_test = train_test_split(S, Y, test_size=0.3, random_state=101)
svc =LinearSVC(C=25.0, dual=False, loss='squared_hinge', penalty='l2', tol=1e-05)
svc.fit(X_train, y_train)
print('The accuracy on the training subset: {:.3f}'.format(svc.score(X_train, y_train)))
print('The accuracy on the test subset: {:.3f}'.format(svc.score(X_test, y_test)))
NEED FOR SCALING...
Following plot shows the difference in minimum and maximum magnitude of the features selected.
plt.figure(figsize=(10,5))
plt.plot(X_train.min(axis=0), 'o', label='Minimum')
plt.plot(X_train.max(axis=0), 'v', label='Maximum')
#plt.xticks(range(X.shape[1]), X, rotation=90)
#plt.hlines(0,0, X.shape[1])
#plt.ylim(-5,5)
plt.xlabel('Feature Index')
plt.ylabel('Feature Magnitude in Log Scale')
plt.yscale('log')
plt.legend(loc='upper right')
plt.xticks(rotation=90)
plt.show()
Scaling Train data...
#Finding the minimum values for each feature
min_train = X_train.min(axis=0)
#Finding the range of each feature
range_train = (X_train - min_train).max(axis=0)
#Scaling the features between the the range 0 to 1
X_train_scaled = (X_train - min_train)/range_train
print('Minimum per feature\n{}'.format(X_train_scaled.min(axis=0)))
print('Maximum per feature\n{}'.format(X_train_scaled.max(axis=0)))
Scaling Test Data...
X_test_scaled = (X_test - min_train)/range_train
svc1 = LinearSVC(C=25.0, dual=False, loss='squared_hinge', penalty='l2', tol=1e-05)
svc1.fit(X_train_scaled, y_train)
print('The accuracy on the training subset: {:.3f}'.format(svc1.score(X_train_scaled, y_train)))
print('The accuracy on the test subset: {:.3f}'.format(svc1.score(X_test_scaled, y_test)))
prediction = svc1.predict(X_test_scaled)
print(confusion_matrix(y_test,prediction))
We can see that we have been successful in increasing the true positives along with decreasing alse negatives.
print(classification_report(y_test,prediction))
Exhaustive search over specified parameter values for an estimator. GridSearchCV implements a “fit” method and a “predict” method like any classifier except that the parameters of the classifier used to predict is optimized by cross-validation.
param_grid = {'C': [0.1,1, 10, 100, 1000]}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(LinearSVC(C=25.0, dual=False, loss='squared_hinge', penalty='l2', tol=1e-05),param_grid,refit=True,verbose=3)
#Fiting our model based on the hyper-parametres and the classifier selected
grid.fit(X_train_scaled,y_train)
#Finding the best hyper-parametres for our model depening upon the dataset
grid.best_params_
grid.best_estimator_
g_prediction = grid.predict(X_test_scaled)
print(confusion_matrix(y_test,g_prediction))
The number of true positives have increased and false negatives have decreased which is a good sign.
print(classification_report(y_test,g_prediction))
We perform evaluation by plotting the AUC curve.
log_roc_auc3 = roc_auc_score(y_test, grid.predict(X_test_scaled))
print ("GRID AUC: ", log_roc_auc3 )
log_roc_auc2 = roc_auc_score(y_test, svc.predict(X_test))
print ("Unscaled SVM AUC: ", log_roc_auc2 )
log_roc_auc1 = roc_auc_score(y_test, svc1.predict(X_test_scaled))
print ("Scaled SVM AUC: ", log_roc_auc1 )
fpr3, tpr3, thresholds3 = roc_curve(y_test, grid.predict(X_test_scaled))
fpr2, tpr2, thresholds2 = roc_curve(y_test, svc.predict(X_test))
fpr1, tpr1, thresholds1 = roc_curve(y_test, svc1.predict(X_test_scaled))
plt.figure(figsize = (10,8))
plt.plot(fpr3, tpr3, label ="GRID (area = %0.2f)" % log_roc_auc3, color ="blue")
plt.plot(fpr2, tpr2, label ="Unscaled SVM(area = %0.2f)" % log_roc_auc2, color ="green")
plt.plot(fpr1, tpr1, label ="Scaled SVM(area = %0.2f)" % log_roc_auc1, color ="orange")
plt.plot([0,1],[0,1], 'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Receiver operating characteristic - SVM')
plt.legend(loc="lower right")
plt.show()
From the plot, we can infer that GRID gives us the best accuracy.
Dataset : Wisconsin Prognostic Breast Cancer Dataset
Glossary: This file contains patients' nuclear features, survival time and chemotherapy information. The 39 columns contain the following information:
= 1 if recur
= 1 if death is caused by cancer
DFS : Disease Free Survival (code a=0)
TTR : Time To Recur (if code a=1)
= 0 otherwise
= 0 otherwise
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from time import time
%matplotlib inline
## importing the dataset
dataset = pd.read_csv("WPBCC.csv")
dataset.head()
data = dataset[['PATIENT','CODE_B', 'CODE_A', 'TIME_A', 'TIME_B', 'RADIUS', 'TEXTURE',
'PERIMETR', 'AREA', 'SMOOTH', 'COMPCT', 'CONCV', 'CONV_PT', 'SYMM',
'FRACT_D', 'SRADIUS', 'STEXTURE', 'SPERIMET', 'SAREA', 'SSMOOTH',
'SCOMPCT', 'SCONCV', 'SCONV_PT', 'SSYMM', 'SFRACT_D', 'WRADIUS',
'WTEXTURE', 'WPERIMET', 'WAREA', 'WSMOOTH', 'WCOMPCT', 'WCONCV',
'WCONV_PT', 'WSYMM', 'WFRACT_D', 'SIZE', 'NODE_ALL', 'CHEMO', 'HORMO']]
data.head()
## X are the varibles used for predictions Y is the target
X = data.drop(['PATIENT','CODE_B'],axis=1)
Y = data['CODE_B']
X.head()
## Fitting the data in random forest algorithm by splitting it into training and testing dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=np.random)
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
tpot = TPOTClassifier(generations=10, population_size=50, verbosity=2)
tpot.fit(X_train, Y_train)
print(tpot.score(X_test, Y_test))
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
etr = ExtraTreesClassifier()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=np.random)
etr.fit(X_train,Y_train)
print('Accuracy on the training subset: {:.3f}'.format(etr.score(X_train, Y_train)))
print('Accuracy on the test subset: {:.3f}'.format(etr.score(X_test, Y_test)))
from sklearn.metrics import confusion_matrix
predictions = etr.predict(X_test)
print(confusion_matrix(Y_test,predictions))
Y_test
predictions
Scaling data
#Finding the minimum values for each feature
min_train = X_train.min(axis=0)
#Finding the range of each feature
range_train = (X_train - min_train).max(axis=0)
#Scaling the features between the the range 0 to 1
X_train_scaled = (X_train - min_train)/range_train
print('Minimum per feature\n{}'.format(X_train_scaled.min(axis=0)))
print('Maximum per feature\n{}'.format(X_train_scaled.max(axis=0)))
X_test_scaled = (X_test - min_train)/range_train
This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. More about Extra tree Classifier
etr1 = ExtraTreesClassifier()
etr1.fit(X_train_scaled,Y_train)
print('The accuracy on the training subset: {:.3f}'.format(etr1.score(X_train_scaled, Y_train)))
print('The accuracy on the test subset: {:.3f}'.format(etr1.score(X_test_scaled, Y_test)))
s_prediction = etr1.predict(X_test_scaled)
print(confusion_matrix(Y_test,s_prediction))
Clearly, we're facing overfitting problem. To avoid this, we perform feature engineering to select only important features that contribute towards prediction. More about RFE
from sklearn.feature_selection import RFE
rfe = RFE(etr, n_features_to_select=10)
rfe.fit(X,Y)
rfe.ranking_
rfe.support_
param_grid = {'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['gini','entropy'], 'n_estimators':[1,5,10,15,20] }
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(ExtraTreesClassifier(),param_grid,refit=True,verbose=3)
#Fiting our model based on the hyper-parametres and the classifier selected
grid.fit(X_train,Y_train)
#Finding the best hyper-parametres for our model depening upon the dataset
grid.best_params_
grid.best_estimator_
g_prediction = grid.predict(X_test)
print('Accuracy on the training subset: {:.3f}'.format(grid.score(X_train, Y_train)))
print('Accuracy on the test subset: {:.3f}'.format(grid.score(X_test, Y_test)))
print(confusion_matrix(Y_test,g_prediction))
from sklearn.cross_validation import cross_val_score
etr_cv = ExtraTreesClassifier()
scores = cross_val_score(etr_cv, X_test, Y_test, cv = 5)
scores
scores.mean()
from sklearn.feature_selection import RFECV
# The "accuracy" scoring is proportional to the number of correct classifications
etr_rfecv = ExtraTreesClassifier()
rfecv = RFECV(estimator=etr_rfecv, step=1, cv=8, scoring='accuracy') #5-fold cross-validation
rfecv = rfecv.fit(X_train_scaled, Y_train)
print('Optimal number of features :', rfecv.n_features_)
print('Best features :', X_train_scaled.columns[rfecv.support_])
rfecv_prediction = rfecv.predict(X_test)
print('Accuracy on the training subset: {:.3f}'.format(rfecv.score(X_train, Y_train)))
print('Accuracy on the test subset: {:.3f}'.format(rfecv.score(X_test, Y_test)))
print(classification_report(y_test,prediction))
print(confusion_matrix(Y_test,rfecv_prediction))
Increased True Negatives : 12
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
log_roc_auc = roc_auc_score(Y_test, etr.predict(X_test))
print ("Unscaled ETR AUC: ", log_roc_auc )
fpr, tpr, thresholds = roc_curve(Y_test, etr.predict_proba(X_test)[:,1])
log_roc_auc2 = roc_auc_score(Y_test, etr1.predict(X_test_scaled))
print ("Scaled ETR AUC: ", log_roc_auc2 )
fpr2, tpr2, thresholds2 = roc_curve(Y_test, etr1.predict_proba(X_test_scaled)[:,1]) ##— Scaled AUC
log_roc_auc3 = roc_auc_score(Y_test, grid.predict(X_test_scaled))
print ("Scaled ETR AUC with GridSearch : ", log_roc_auc3 )
fpr3, tpr3, thresholds3 = roc_curve(Y_test, grid.predict_proba(X_test_scaled)[:,1]) ##— Scaled with gridSearch AUC
log_roc_auc4 = roc_auc_score(Y_test, rfecv.predict(X_test_scaled))
print ("Scaled ETR AUC with RFECV : ", log_roc_auc4 )
fpr4, tpr4, thresholds4 = roc_curve(Y_test, rfecv.predict_proba(X_test_scaled)[:,1]) ##— Scaled with RFECV AUC
plt.figure(figsize = (10,8))
plt.plot(fpr, tpr, label ="Unscaled (area = %0.2f)" % log_roc_auc, color ="blue")
plt.plot(fpr2, tpr2, label ="Scaled (area = %0.2f)" % log_roc_auc2, color ="red")
plt.plot(fpr3, tpr3, label ="Scaled with GRID (area = %0.2f)" % log_roc_auc3, color ="green")
plt.plot(fpr4, tpr4, label ="Scaled with RFECV (area = %0.2f)" % log_roc_auc4, color ="aquamarine")
plt.plot([0,1],[0,1], 'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Receiver operating characteristic - ExtraTreeClassifier')
plt.legend(loc="lower right")
plt.show()
The parameters considered in the experiments were as follows: (1) Train Accuracy, (2) Test Accuracy, (3) Precision (4) Recall , (5) f1-score, and (6) ROC score
| Parameter | Logistic Regression | Linear SVC | Extra Tree Classifier with RFECV |
|---|---|---|---|
| Train Accuracy | 0.977 | 0.877 | 0.901 |
| Test Accuracy | 0.944 | 0.817 | 0.961 |
| Precision | 0.95 | 0.81 | 0.86 |
| Recall | 0.95 | 0.82 | 0.87 |
| f1-score | 0.95 | 0.80 | 0.86 |
| roc score | 0.94 | 0.79 | 0.97 |
REFERENCES
[6] http://www.randalolson.com/2016/05/08/tpot-a-python-tool-for-automating-data-science/
[7] https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
[8] ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/
[9] https://towardsdatascience.com/machine-learning-part-3-logistics-regression-9d890928680f
The text in the document by Azadeh Bashiri, Marjan Ghazisaeedi, Reza Safdari, Leila Shahmoradi, Hamide Ehtesham, Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V. Karamouzis, Dimitros I. Fotiadis, Ahmad LG, Eshlaghy AT, Poorebrahimi A, Ebrahimi M, Razavi AR, Abien Fred, M. Agarap and Randall Olson is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/us/
The code in the document by Cristi Vlad is licensed under the MIT License https://opensource.org/licenses/MIT